Yushi Tang
March 27/29, 2019
A simple method would be to assign the nearest gene or the gene containing peaks in its promoter region
But ChIP-seq experiments are often examined in the context of gene expression, and thus expression profiles are available for both the factor-bound and factor-unbound conditions
So how can this information be incorporated with the peaks to improve inference about target genes?
Figure 1
Use LIMMA to detect up-/down-DE genes (CSV file)
Binding events/peaks called from MACS (BED file)
By category (up,down,unchanged), rank the genes by decreasing score
Calculate cumulative distribution function of the gene groups
Use a one-tailed Kolmogorov-Smirnov test to determine whether the groups differ significantly from the unchanged group
Figure 2A
Figure 3A
Figure 3B
Sort and assign ranks to peaks based on regulatory potential
Sort and assign ranks to genes based on differential expression
Compute rank product (interpreted as a p-value).
Results are output as a table. Threshold based on score/p-value for confident target genes
BETA conducts motif analysis on sites proximal to the targets to identify factor-binding motifs
Compares the number of motifs near the ChIP-seq binding summits with that in flanking regions to detect motifs with marked summit enrichment
Results summarized in a web page
Figure 2B
Figure 2C
Figure 4
Gene ontology of gene list (DAVID, separate by up-/down-regulated)
Motif analysis
Co-factor analysis (Intersect the BETA list with the original DE list)
BETA-basic, BETA-plus and BETA-minus
BETA-basic can be used to predict whether a factor has activating or repressive function and detect direct target genes
BETA-plus can be used to predict whether a factor has activating or repressive function, whether it can detect direct target genes and whether it can analyze sequence motifs in target regions
BETA-minus is used when only binding data are available to predict target genes based on distance
Both binding and differential expression data are required for BETA-basic and BETA-plus
Tang, Y. (2018). GSP Freeze1 Data Analysis – Data Annotation and Quality Control. Paper presented at Broad Institute and Department of Biostatistics, Harvard University, MA.
The main topic for this experience is ChIP-seq, motif finding, and expression integration. Androgen receptor (AR) is a transcription factor frequently over-activated in prostate cancer. To study AR regulation in prostate cancer, scientists conducted AR ChIP-seq in prostate tumors and normal prostate tissues. Since the difference between individual patients could be quite big, this study actually included many more tumor and normal samples. However, for the purpose of this HW, we will only use the ChIP-seq data from 1 prostate tumor samples (tumor) and 1 normal prostate tissues (normal).
Hint: 1). All data needed for this HW are stored at /n/stat115/HW4_2019 on the Odyssey. 2). It would be helpful to read the MACS README and Nature Protocol paper:
https://pypi.python.org/pypi/MACS2/2.0.10.09132012
http://liulab.dfci.harvard.edu/publications/NatProtoc12_1728.pdf
Usually we use BWA to map the reads to the genome for ChIP-seq experiment. We will give you one example ChIP-seq single-end sequenced .fastq file with only 1M reads. Run BWA on this file to Hg38 of the human genome assembly. Report the commands, logs files, and a snapshot / screenshot of the output to demonstrate your alignment procedure. What proportion of the reads are successfully mapped (to find at least one location) and what proportions are uniquely mapped (to find a single location) in the human genome in this test sample? We will save you some time and directly give you the BWA mapped BAM files for the sample.
Hint: 1). Target sample fastq file is stored as /n/stat115/HW4_2019/tumor_1M.fastq on the Odyssey 2). The index file is stored as /n/stat115/HW2_2019/bwa_hg38_index/hg38.fasta on the Odyssey
# your shebang
module load bwa/0.7.15-fasrc02
bwa mem /path/to/index/fasta /path/to/input/data > /path/to/output/file/your_output_name.sam# samtools might be useful to acquire the summary statistics
# of course you have to load the samtools module first
$ samtools flagstat bwa.sam
$ samtools view -bq 1 bwa.sam > unique.bam
$ samtools flagstat unique.bamIn ChIP-Seq experiments, when sequencing library preparation involves a PCR amplification step, it is common to observe multiple reads where identical nucleotide sequences are disproportionally represented in the final results. This is especially a problem in tissue ChIP-seq experiments (as compared to cell lines) when input cell numbers are low. Removing these duplicated reads can improve the peak calling accuracy. Thus, it may be necessary to perform a duplicate read removal step, which flags identical reads and subsequently removes them from the dataset. Run this on your test sample (1M reads) (macs2 filterdup). What % of reads are redundant? When doing peak calling, MACS filters duplicated reads by default.
Hint: The test samples are stored as /n/stat115/HW4_2019/tumor.bam and /n/stat115/HW4_2019/normal.bam on the Odyssey.
# your shebang
module load
macs2 filterdup -i /path/to/input/bam/file -g hs --keep-dup 1 -o ./path/to/output/bed/file/your_output_name.bedFor many ChIP-seq experiments, usually chromatin input without enriching for the factor of interest is often generated as control. However, in this experiment, we only have ChIP and no control samples. Without control, MACS2 will use the signals around the peaks to infer the chromatin background and estimate the ChIP enrichment over background. What is the estimated fragment size in each? Use MACS2 to call peaks from tumor1 and normal1 separately. How many peaks do you get from each condition with FDR < 0.05 and fold change > 5?
Call peak for normal sample
# your shebang
module load centos6/0.0.1-fasrc01
module load macs2/2.1.2_dev-fasrc01
macs2 callpeak -t /path/to/your/input/sample/bed/file.bed -f AUTO -g 2.7e9 -q 0.05 -m 6 50 --outdir path/to/save/your/output/ -n prefix_of_your_output-t/--treatment filename
-c/--control
-n/--output name
-f/--format of tag files
--outdir/--the folder where all the output files saved into
-n/--name of the output as NAME_peaks.bed
-g/--gsize The default hs -- 2.7e9 is recommended as_for UCSC human hg18 assembly
-q/--qvalue (minimum FDR) cutoff to call significant regions. Default is 0.05.Repeat this for the tumor sample
Now we want to see whether AR has differential binding sites between prostate tumors and normal prostates. MACS2 does have a function to call differential peaks between conditions, but requires both conditions to have input control. Since we don’t have input controls for these AR ChIP-seq, we will just run the AR tumor ChIP-seq over the AR normal ChIP-seq (pretend the latter to be input control) to find differential peaks. How many peaks do you get with FDR < 0.01 and fold change > 6?
# your shebang
module load centos6/0.0.1-fasrc01
module load macs2/2.1.2_dev-fasrc01
macs2 callpeak -t path/to/your/treat.bed -c path/to/your/control.bed -f AUTO -g 2.7e9 -q 0.01 --fe-cutoff 6 --outdir path/to/your/output/folder/ -n prefix_of_your_outputCistrome Data Browser (http://cistrome.org/db/) has collected and pre-processed most of the published ChIP-seq data in the public. Play with Cistrome DB. Biological sources indicate whether the ChIP-seq is generated from a cell line (e.g. VCaP, LNCaP, PC3, C4-2) or a tissue (Prostate). Are there over 10 AR ChIP-seq data available in human prostate tissues?
Doing transcription factor ChIP-seq in tissues could be a tricky experiment, so sometimes even published studies have very bad data. Look at a few AR ChIP-seq samples in the prostate tissue on Cistrome and inspect their QC reports. Can you comment on what QC measures tell you whether a ChIP-seq is of good or bad quality. Include a screen shot of a good AR ChIP-seq vs a bad AR ChIP-seq.
Antibody is one important factor influencing the quality of a ChIP-seq experiment. Click on the GEO (GSM) ID of some good quality vs bad quality ChIP-seq data, and see where they got their AR antibodies. If you plan to do an AR ChIP-seq experiment, which company and catalog # would you use to order the AR antibody?
We want to see in prostate tumors, which other transcription factors (TF) might be collaborating with AR. Try any of the following motif finding tools to find TF motifs enriched in the differential AR peaks you identified above. Did you find the known AR motif, and motifs of other factors that might interact with AR in prostate cancer in gene regulation? Describe the tool you used, what you did, and what you found. Note that finding the correct AR motif is usually an important criterion for AR ChIP-seq QC.
Cistrome: http://cistrome.org/ap/root (Register a free account).
Weeder: http://159.149.160.88/pscan_chip_dev/
Look at the AR binding distribution in Cistrome DB from a few good AR ChIP-seq data in prostate. Does AR bind mostly in the gene promoters, exons, introns, or intergenic regions? Also, look at the QC motifs to see what motifs are enriched in the ChIP-seq peaks. Do you see similar motifs here as those you found in your motif analyses?
Sometimes members of the same transcription factor family (e.g. STAT1, 2, 3, 4, 5, 6) have similar binding motifs, similar binding sites (when they are expressed, although they might be expressed in very different tissues), and related functions. Therefore, to confirm that we have found the correct TFs interacting with AR in prostate tumors, in addition to looking for motifs enriched in the AR ChIP-seq, we also want to see whether the TFs are highly expressed in prostate tumor. For this, we will use TIMER (https://cistrome.shinyapps.io/timer/). First, try the “Diff Exp” tab. Check the top motifs you found before, and see which member of the TF family that recognizes the motif is highly expressed in prostate tissues or tumors. Another way is to see whether the TF family member and AR have correlated expression pattern in prostate tumors. Go to the “Correlation” tab, select prostate cancer (PRAD), enter AR and the gene you are interested in, correct the correlation by tumor purity, and see whether the candidate TF is correlated with AR in prostate tumors. Based on the motif and expression evidences, which factors are the most likely collaborator of AR in prostate cancer?
Note: When we conduct RNA-seq on prostate tumors, each tumor might contain cancer cells, normal prostate epithelia cells, stromal fibroblasts, and other immune cells. Therefore, genes that are highly expressed in cancer cells (including AR) could be correlated in different tumors simply due to the tumor purity bias. Therefore, when looking for genes correlated with AR just in the prostate cancer cells, we should correct this tumor purity bias.
Solution:
See which member of the TF family that recognizes the motif is highly expressed in prostate tissues or tumors
See whether the TF family member and AR have correlated expression pattern in prostate tumors
Besides looking for motif enrichment, another way to find TFs that might interact with AR is to see whether there are other TF ChIP-seq data which have significant overlap with AR ChIP-seq. Take the differential AR ChIP-seq peaks (in .bed format) that are significantly higher in tumor than normal, and run this on the Cistrome Toolkit (http://dbtoolkit.cistrome.org/). The third function in Cistrome Toolkit looks through tens of thousands of published ChIP-seq data to see whether any have significant overlap with your peak list. You should see AR enriched in the results (since your input is a list of AR ChIP-seq peaks after all). What other factors did you see enriched? Do they agree with your motif analyses before?
Solution:
Detect collaborating TFs
Now we try to see what target genes these AR binding sites regulate. Among the differentially expressed genes in prostate cancer, only a subset might be directly regulated by AR binding. One simple way of getting the AR target genes is to look at which genes have AR binding in its promoters. Write a python program that takes two input files: 1) the AR differential ChIP-seq peaks in tumor over normal; 2) refGene annotation. The program outputs to a file containing genes that have AR ChIP-seq peak (in this case, stronger peak in tumor) within 3KB + / - from the transcription start site (TSS) of the gene. How many putative AR target genes in prostate cancer do you get using this approach?
Note: From UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/), download the human RefSeq annotation table (find the file refGene.txt.gz for Hg38). To understand the columns in this file, check the query annotation at http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.sql.
Hint: TSS is different for genes on positive or negative strand, i.e. TSS is “txStart” for genes on the positive strand, “txEnd” for genes in negative strand. When testing your python code, try smaller number of gene annotations or smaller number of peaks to check your results before moving forward.
# Prepare the reference .bed file
TSS <- read.table('./data/refGene.txt')
head(TSS)## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1 585 NR_046018 chr1 + 11873 14409 14409 14409 3
## 2 585 NR_024540 chr1 - 14361 29370 29370 29370 11
## 3 585 NR_106918 chr1 - 17368 17436 17436 17436 1
## 4 585 NR_107062 chr1 - 17368 17436 17436 17436 1
## 5 585 NR_107063 chr1 - 17368 17436 17436 17436 1
## 6 585 NR_128720 chr1 - 17368 17436 17436 17436 1
## V10
## 1 11873,12612,13220,
## 2 14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320,
## 3 17368,
## 4 17368,
## 5 17368,
## 6 17368,
## V11 V12
## 1 12227,12721,14409, 0
## 2 14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370, 0
## 3 17436, 0
## 4 17436, 0
## 5 17436, 0
## 6 17436, 0
## V13 V14 V15 V16
## 1 DDX11L1 unk unk -1,-1,-1,
## 2 WASH7P unk unk -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
## 3 MIR6859-1 unk unk -1,
## 4 MIR6859-2 unk unk -1,
## 5 MIR6859-3 unk unk -1,
## 6 MIR6859-4 unk unk -1,
TSS <- TSS[,c(3,5,6,4,2)]
colnames(TSS) <- c('chr','start','end','strand','id')
TSS_positive_strand <- subset(TSS,TSS$strand=='+')
TSS_negative_strand <- subset(TSS,TSS$strand=='-')
head(TSS_positive_strand)## chr start end strand id
## 1 chr1 11873 14409 + NR_046018
## 7 chr1 30365 30503 + NR_036051
## 8 chr1 30365 30503 + NR_036266
## 9 chr1 30365 30503 + NR_036267
## 10 chr1 30365 30503 + NR_036268
## 14 chr1 69090 70008 + NM_001005484
TSS_positive_strand$end <- TSS_positive_strand$start
head(TSS_positive_strand)## chr start end strand id
## 1 chr1 11873 11873 + NR_046018
## 7 chr1 30365 30365 + NR_036051
## 8 chr1 30365 30365 + NR_036266
## 9 chr1 30365 30365 + NR_036267
## 10 chr1 30365 30365 + NR_036268
## 14 chr1 69090 69090 + NM_001005484
head(TSS_negative_strand)## chr start end strand id
## 2 chr1 14361 29370 - NR_024540
## 3 chr1 17368 17436 - NR_106918
## 4 chr1 17368 17436 - NR_107062
## 5 chr1 17368 17436 - NR_107063
## 6 chr1 17368 17436 - NR_128720
## 11 chr1 34610 36081 - NR_026818
TSS_negative_strand$start <- TSS_negative_strand$end
head(TSS_negative_strand)## chr start end strand id
## 2 chr1 29370 29370 - NR_024540
## 3 chr1 17436 17436 - NR_106918
## 4 chr1 17436 17436 - NR_107062
## 5 chr1 17436 17436 - NR_107063
## 6 chr1 17436 17436 - NR_128720
## 11 chr1 36081 36081 - NR_026818
TSS_new <- rbind(TSS_positive_strand,TSS_negative_strand)
TSS_new$end <- TSS_new$start+1
write.table(TSS_new,"./data/TSS_HG38.bed", sep = '\t',
quote=FALSE,row.names=FALSE, col.names=FALSE)# Please remember to apply for an interactive session
srun --pty -p test -t 0-5:00 --mem 10000 /bin/bash
# Load required modules
module load centos6/0.0.1-fasrc01
module load bedtools/2.17.0-fasrc01
# Detect putative AR target genes
bedtools window -w 3000 -u -a ../work/TSS_HG38.bed -b ../work/HW4Q3_tumor_over_normal_summits.bed > ../work/putative_AR_target_genes.bed
# Count the number of putative AR target genes
wc -l ../work/putative_AR_target_genes.bedNow overlap the putative AR target genes you get from above with up regulated genes in prostate cancer(up_regulated_genes_in_prostate_cancer.txt). Try to run DAVID on 1) the AR target genes from binding alone and 2) the AR target genes by overlapping AR binding with differential expression. Are there enriched GO terms or pathways?
putative <- read.table('./data/putative_AR_target_genes.txt')
colnames(putative) <- c('chr','start','end','id','strand')
nrow(putative)
diff_exp <- read.table('./data/up_regulated_genes_in_prostate_cancer.txt')
colnames(diff_exp) <- c('id')
nrow(diff_exp)
overlap <- merge(putative, diff_exp, by = 'id')
nrow(overlap)Another way of getting the AR target genes is to consider the number of AR binding sites within 100KB of TSS, but weight each binding site by an exponential decay of its distance to the gene TSS (i.e. peaks closer to TSS have higher weights). For this, we have calculated regulatory potential score for each refseq gene(AR_peaks_regulatory_potential.txt). Select the top 1500 genes with highest regulatory potential score, try to run DAVID both with and without differentially expression, and see the enriched GO terms.
Note: Basically this regulatory potential approach assumes that there are stronger AR targets (e.g. those genes with many AR binding sites within 100KB and have stronger differential expression) and weaker AR targets, instead of a binary Yes / No AR targets.
Comment on the AR targets you get from promoter binding (your code) and distance weighted binding. Which one gives you better function / pathway enrichment? Does considering differential expression help?
Please submit your solution directly on the canvas website. Please provide both your code in this Rmd document and an html file for your final write-up. Please pay attention to the clarity and cleanness of your homework.
The teaching fellows will grade your homework and give the grades with feedback through canvas within one week after the due date. Some of the questions might not have a unique or optimal solution. TFs will grade those according to your creativity and effort on exploration, especially in the graduate-level questions.